feat(upload): idempotent retry of the last committed chunk by rubenhensen · Pull Request #145 · encryption4all/cryptify

rubenhensen · 2026-05-07T11:21:58Z

Summary

When a chunk PUT response is lost in flight the client retains the previous CryptifyToken (the one it sent), but the server has already advanced the rolling-token chain. Without server-side help, retry is impossible: the naïve resend hits "token mismatch" or "wrong offset" depending on what the client tries.

This PR adds the missing protocol piece — server-side detection of duplicate retries — so client-side retry (coming next in postguard-js) can actually work.

How it works

The server now caches the most recently committed chunk's (prev_token, prev_uploaded, chunk_len, chunk_sha256, response_token) on the FileState (in-memory for now; will migrate to Postgres in Phase 2). The chunk handler is restructured around a classify_chunk_request helper with three outcomes:

NormalNext — state.uploaded == start && header_token == current_token. Write, advance, refresh last_chunk.
ReplayLastChunk — every cached field matches AND the body's SHA-256 matches the stored hash. Return the cached response_token, no write, no double-count.
Reject — regular 400s. Stricter 400s when the request looks like a retry (matching prev_token + offset) but the body or length differ — never accept different bytes for the same offset.

The body is now read once at the top of the handler so the same bytes feed both the replay-detection hash and the normal write path. Per-upload-limit and file-open are still checked only on the normal-write path; on retry the original PUT already passed those.

Wire compatibility

Existing clients that don't retry see no change. The contract only kicks in when a client sends (prev_token, prev_uploaded) matching the cached record — old clients never do. Documented in api-description.yaml so future clients (postguard-js v1.4 in particular) can rely on it.

Test plan

7 new unit tests cover the classifier:

classify_normal_next_chunk — happy path
classify_replays_last_chunk_on_matching_retry — full match returns cached token
classify_rejects_retry_with_different_body — same prev_token + offset, body bytes differ → 400
classify_rejects_retry_with_different_length — same prev_token + offset, length differs → 400
classify_rejects_offset_mismatch_with_no_replay — no last_chunk cached → regular 400
classify_rejects_token_mismatch_at_correct_offset — right offset, wrong token, no replay match → 400 with the existing token-mismatch message
classify_does_not_replay_when_prev_token_does_not_match — last_chunk cached but a different prev_token → falls through to regular Content-Range 400

CI gate:

cargo fmt --check passes
cargo clippy --all-targets -- -D warnings passes
cargo test --all-targets passes (49 tests, none regressed)

Refs

Issue Upload and download sessions are not robust #136 (resume protocol design from @dobby-coder's audit, comment 4395100725)
Issue Upload fails with 404 on chunk PUT — no retry, entire upload lost postguard-website#117
Builds on fix(store): extend upload session deadline on each chunk PUT #132 (idle-TTL + per-chunk touch) and feat(upload): structured 404 body + configurable session TTL #144 (structured 404)
Required before postguard-js v1.4 retry/backoff (PR coming next in this repo's sibling repo)

The deliberately-deferred status-endpoint design from #136 (cross-refresh resume) will be filed as a separate issue once Phase 2 (Postgres-backed sessions) is in place to support it.

When a chunk PUT response is lost in flight the client retains the old CryptifyToken (the one it sent) but the server has already advanced the rolling-token chain. Without server-side help, retry is impossible: the naïve resend hits 'token mismatch' or 'wrong offset' depending on what the client tries. Server now caches the just-committed chunk's (prev_token, prev_uploaded, chunk_len, chunk_sha256, response_token) on the FileState. The chunk handler is restructured around a `classify_chunk_request` helper with three outcomes: - NormalNext: state.uploaded == start && header_token == current_token → write, advance, refresh last_chunk - ReplayLastChunk: every cached field matches AND body sha256 matches → return cached response_token, no write, no double-count - Reject: regular 400s. Stricter 400s when the request looks like a retry but the body or length differ — never accept different bytes for the same offset. The body is now read once at the top of the handler so the same bytes feed both the replay-detection hash and the normal write path. Per- upload limit and file-open are still checked only on the normal-write path; on retry the original PUT already passed those. api-description.yaml documents the contract on PUT /fileupload/{uuid}. Tests: seven new unit tests cover NormalNext, ReplayLastChunk match, length mismatch, body-hash mismatch, offset mismatch with no replay record, token mismatch at correct offset, and prev_token mismatch with a replay record present (falls through to regular Content-Range 400). Refs #136 (resume protocol design from @dobby-coder's audit), encryption4all/postguard-website#117.

rubenhensen · 2026-05-07T11:22:38Z

/dobby review this

dobby-coder · 2026-05-07T11:24:29Z

Dobby is here, Master Ruben! Dobby reads the request and Dobby will review the chunk-retry protocol with great care! Dobby will check the classifier logic, the tests, and the wire-compat story — then Dobby will leave a proper review!

dobby-coder

Review

Solid PR. The classifier abstraction reads well, the test coverage is thorough (especially the negative cases), and the wire-compat story checks out — old clients never present a (prev_token, prev_uploaded) pair that matches a cached last_chunk, so they fall straight through to the existing paths.

I built and ran the suite locally on feat/idempotent-chunk-retry: cargo test --all-targets → 49 passed, 0 failed. CI's Rust-quality job is green; the amd64/arm64 image builds were still pending at review time.

Things I'd consider before merge

Body is now read before the per-upload-limit and token checks. Pre-PR ordering rejected a request with the wrong cryptifytoken or wrong offset before consuming any body. Now the body — up to chunk_size (~1 MiB) — is always read into memory first so the classifier can hash it. UUIDs are unguessable, so this is narrow, but it does slightly widen the work an attacker with a leaked UUID can force per request. Probably acceptable; flagging in case you want to gate the body read behind a cheap pre-check (e.g., only read the body when state.uploaded == start or last_chunk.prev_uploaded == start).
chunk_sha256 is arguably redundant. compute_hash(last.prev_token, body) == last.response_token already proves the body is byte-identical (modulo a SHA-256 collision, which would also break the rolling-token chain itself). You could drop chunk_sha256 (and chunk_len, since body.len() is implicit in the hash input) from LastChunkRecord and verify replay by recomputing the rolling hash. Two upsides: smaller FileState, and one less invariant to keep in sync when Phase 2 moves this to Postgres. Downside: the current explicit fields are more obvious to a reader. Judgment call — not a blocker.
Duplicated literal "Cryptify Token header does not match" in the classifier (src/main.rs:615) vs. check_cryptify_token (still used by finalize). The comment notes the wording is intentional; an const TOKEN_MISMATCH_MSG: &str referenced by both call sites would prevent silent drift if the message ever changes.
Minor: two shasum.clone() calls (src/main.rs:546 and :553) where one would do — the final CryptifyToken(shasum) could take the moved value and state.cryptify_token / response_token could share a single clone.
Phase 2 note (FYI, not for this PR): when last_chunk migrates to Postgres, response_token becomes a row that must be written atomically with the file write + state.uploaded advance, otherwise a server crash between "write file" and "persist replay record" reintroduces exactly the scenario this PR fixes. Worth keeping in mind for the migration design — probably wants the chunk write to be idempotent at the storage layer too, not just the response-replay layer.

Nothing here is blocking. Approve from this agent's side once CI's image builds finish.

Four small follow-ups from @dobby-coder's review: 1. Cheap pre-check before reading the chunk body. Mirrors the structural part of `classify_chunk_request` (normal-next OR replay-candidate) and rejects mismatches before consuming up to `chunk_size` bytes, restoring the prior security profile against UUIDs leaked to an attacker. 2. Drop `chunk_sha256` and `chunk_len` from `LastChunkRecord`. Body identity falls out of the rolling-token chain itself: recompute `compute_hash(prev_token, body)` and compare to `response_token`. Length divergence surfaces as a hash mismatch — one less invariant to keep in sync, smaller `FileState`, simpler Phase 2 schema. The length-specific test folds into the body-mismatch test. 3. Promote the duplicated "Cryptify Token header does not match" string to a `TOKEN_MISMATCH_MSG` const referenced by both `check_cryptify_token` (finalize path) and the chunk classifier so the wording can't drift silently between call sites. 4. One fewer `shasum.clone()` — the final responder takes the moved value while `state.cryptify_token` and `last_chunk.response_token` share a single clone. `sha256_of` is removed (sole caller is gone). Tests gain a `last_chunk_for(prev_token, offset, body)` helper that builds a `LastChunkRecord` whose `response_token` is the actual rolling hash, matching the production handler's invariant. Refs #145 review.

rubenhensen · 2026-05-07T11:33:10Z

Thanks @dobby-coder. Pushed 2a6dac1 addressing #1–#4:

DoS gate before body read — handler now does the cheap structural check (normal-next OR replay-candidate) before consuming the body, restoring the prior security profile. Reuses the same predicates classify_chunk_request does so the two stay in sync.
Dropped chunk_sha256 and chunk_len — replay verification is now compute_hash(prev_token, body) == response_token, which is the same construction the rolling-token chain itself relies on. Length divergence falls out as a hash mismatch. The dedicated length test folds into the body-mismatch test.
TOKEN_MISMATCH_MSG const — referenced by both check_cryptify_token (finalize) and the classifier.
One fewer shasum.clone() — the final responder takes the moved value.

Skipping #5 here since it's a Phase 2 concern. Filing it as a tracking note on the Postgres-sessions design (the persist-and-advance ordering will need to be transactional with the file write to keep the same guarantees).

49 tests pass, fmt + clippy clean.

Bumps `@e4a/pg-js` from `^1.3.0` to `^1.4.0` to pick up the new upload/download retry machinery (chunk PUTs and download GETs retry with exponential backoff + full jitter on 5xx and network errors). The bump itself is a clean semver minor — additive config, no API changes for the website. UX additions on top of the bump: - **Retry indicator.** `src/lib/postguard.ts` now wires an `onRetry` callback into the SDK that updates a small Svelte store (`retryStatus`). Both `SendButton.svelte` (upload) and `download/+page.svelte` subscribe to it and render a single line ("Connection hiccup, retrying… (attempt N of M)") underneath the active spinner during the retry window. Cleared on success or terminal error so a stale event can't leak between operations. - **`UploadSessionExpiredError` distinguished from generic 5xx.** The SDK now raises this dedicated error when cryptify reports `upload_session_not_found` (idle past the configured TTL or the server restarted). On upload it surfaces a regular Error state with `serverError = false` so the existing copy doesn't blame the server for a state-loss problem retry can't fix; on download it gets its own state and copy ("Upload session expired" / "Ask the sender to send the files once more"). Localised in en + nl. Also unblocks the husky pre-commit hook: the repo had `prettier-plugin-svelte` as a devDependency but never registered it in the prettier config, so `prettier --write` on staged paths failed on .svelte files (see #143's bypass note). The plugin is now wired up via the `plugins` field in the package.json `prettier` block, and all 41 .svelte files were reformatted to apply the resulting style (mostly stray semicolons + minor whitespace). Six pre-existing `eslint-disable-next-line` comments that prettier broke by wrapping their target tags onto multiple lines were widened to `eslint-disable` / `eslint-enable` blocks. Refs: postguard-website#117, encryption4all/cryptify#136, encryption4all/cryptify#145, encryption4all/postguard-js#47.

dobby-coder Bot reviewed May 7, 2026

View reviewed changes

rubenhensen merged commit 13661c7 into main May 7, 2026
5 checks passed

rubenhensen deleted the feat/idempotent-chunk-retry branch May 7, 2026 11:33

rubenhensen mentioned this pull request May 7, 2026

feat(upload): retry chunks and downloads with exponential backoff encryption4all/postguard-js#47

Merged

15 tasks

github-actions Bot mentioned this pull request May 7, 2026

chore: release v0.1.26 #138

Merged

dobby-coder Bot mentioned this pull request May 7, 2026

Staging manual-test checklist — 2026-05-07 deploy cycle encryption4all/postguard-business#48

Open

58 tasks

rubenhensen mentioned this pull request May 7, 2026

feat(filesharing): pg-js 1.4 + transparent retry UX encryption4all/postguard-website#145

Merged

14 tasks

This was referenced May 7, 2026

feat(upload): status endpoint for cross-refresh resume #146

Closed

feat(download): resume downloads via Range on transient failure encryption4all/postguard-js#48

Closed

feat(upload): status endpoint for cross-refresh resume #148

Merged

dobby-coder Bot mentioned this pull request May 7, 2026

docs(cryptify): document /status endpoint, recovery_token, structured 404, session_ttl_secs encryption4all/postguard-docs#67

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(upload): idempotent retry of the last committed chunk#145

feat(upload): idempotent retry of the last committed chunk#145
rubenhensen merged 2 commits intomainfrom
feat/idempotent-chunk-retry

rubenhensen commented May 7, 2026

Uh oh!

rubenhensen commented May 7, 2026

Uh oh!

dobby-coder Bot commented May 7, 2026

Uh oh!

dobby-coder Bot left a comment

Uh oh!

rubenhensen commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rubenhensen commented May 7, 2026

Summary

How it works

Wire compatibility

Test plan

Refs

Uh oh!

rubenhensen commented May 7, 2026

Uh oh!

dobby-coder Bot commented May 7, 2026

Uh oh!

dobby-coder Bot left a comment

Choose a reason for hiding this comment

Review

Things I'd consider before merge

Uh oh!

rubenhensen commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant